Photo by Jovana Askrabic on Unsplash
R is the name of the programming language itself and RStudio is a convenient interface.
The main goal of this assignment is to introduce you to R and RStudio, which we will be using throughout the course both to learn the statistical concepts discussed in the course and to analyze real data and come to informed conclusions.
git is a version control system (like “Track Changes” features from Microsoft Word on steroids) and GitHub is the home for your Git-based projects on the internet (like DropBox but much, much better).
An additional goal is to introduce you to Git and GitHub, which is the collaboration and version control system that we will be using throughout the course.
As the assignments progress, you are encouraged to explore beyond what the assignments (and labs) dictate; a willingness to experiment will make you a much better programmer. Before we get to that stage, however, you need to build some basic fluency in R. Today we begin with the fundamental building blocks of R and RStudio: the interface, reading in data, and basic commands.
And to make versioning simpler, this is a solo lab. Additionally, we want to make sure everyone gets a significant amount of time at the steering wheel. In future labs we may learn about collaborating on GitHub and produce a single lab report for your team.
As we discussed in class, during this course we will mostly use three tools: R, RStudio and Git/GitHub. First, we need to make sure that these tools are readily available on your machine.
Go to GitHub and create your own profile.
To install Git on your machine use this link. You’ll be asked several questions about what you want to install and various settings. You can keep all the options at their default values.
To install R and Rstudio click on this link and follow the two steps on the Posit’s webpage. Step 1 requires you to install R. After you completed this step you can proceed with Step 2 and install RStudio.
Each of your assignments will begin with the following steps. You saw these once in class yesterday, they’re outlined in detail here again. Going forward each lab will start with a “Getting started” section but details will be a bit more sparse than this. You can always refer back to this lab for a detailed list of the steps involved for getting started with an assignment.
Go to the course GitHub page, click on Repositories, and look for the repository associated to the lab or homework that you are reading (in this case, hw-01).
Click on the repository name.
In the top right bar, click on the Fork button. This will create a copy of the repository in your own account.
Select your username as the owner and choose a name for the repository (you can keep the same if you wish).
Now click on the round icon in the top right corner of the page and select Your Repositories. You should now see a list of all your repositories including the one you forked in the step above. Click on it.
Click on the green Clone or download button, select Use HTTPS (this might already be selected by default, and if it is, you’ll see the text Clone with HTTPS as in the image below). Click on the clipboard icon to copy the repo URL.
Go to RStudio and click on File in the main menu. Create a New Project and select the Version Control option. Click on Git and when prompted copy the URL you got from GitHub.
Hit OK, and you’re good to go!
Once you’ve opened up RStudio, we will want to introduce RStudio to git and set up our personal access token, these replaced passwords in 2021. If you have already done this in the homework, skip ahead to step 6. Our personal access token will connect our project to our github account. This is perhaps the most tedious part of the process as you often have to (re)introduce yourself to git when you make a new project.
In the console (lower left panel in RStudio), run the following code:
Note that before git is connected it will show that there is no personal access token <unset>.
To add a token we can run the following code in the console:
Add a description (e.g., “ECON 422”, the course name) and update the expiry date to never expire. This will mean that it does not need to be updated during the course.
Aside from the description and expiry date, you can leave the other default settings.
Copy and save your generated token by clicking on the clipboard symbol. Keep your token somewhere you can find it later, like in a text file in your course folder.
Note: You will need to “introduce” yourself to GitHub every time you start a new project, so make sure to keep the key in a place easily accessible
Now we’re ready to set our github credentials.
Now, you’ll notice that it says that the personal access token has been discovered.
Note that for future projects you can skip directly to Step 6. You do not need to generate a personal access token each time.
Before we introduce the data, let’s warm up with some simple exercises.
The top portion of your R Markdown file (between the three dashed lines) is called YAML. It stands for “YAML Ain’t Markup Language”. It is a human friendly data serialization standard for all programming languages. All you need to know is that this area is called the YAML (we will refer to it as such) and that it contains meta information about your document.
Open the R Markdown (Rmd) file in your project, change the author name to your name, and knit the document.
Then go to the Git pane in your RStudio.
You should see that your Rmd (R Markdown) file and its output, your md file (Markdown), file are listed there as recently changed files.
Next, click on Diff. This will pop open a new window that shows you the difference between the last committed state of the document and its current state that includes your changes. If you’re happy with these changes, click on the checkboxes of all files in the list, and type “Update author name” in the Commit message box and hit Commit.
You don’t have to commit after every change, this would get quite cumbersome. You should consider committing states that are meaningful to you for inspection, comparison, or restoration. Your commits will be the trail of breadcrumbs you leave in your analysis that allow you to retrace your steps. Version control is about tracking who made what change and why. Your github account capture the who, your commit message should capture what was done and when it is not obvious from the commit message, also the why. If someone were to rerun the code committed, the message should describe what that code will do.
Your messages should follow the guidance from the blog post, seven rules of a great GIT commit message, to construct your message. In this course we will be mainly using rules 3-5, but you might be interested in reading about the others.
In the first few assignments we will tell you exactly when to commit and in some cases, what commit message to use. As the semester progresses we will let you make these decisions.
Thought Exercise: Why follow rules? Why can’t I write whatever I want?2 There’s nothing stopping you from writing what you want in a commit message. Following shared rules and standards provide consistency and promote clarity and shared understanding allowing your future self and collaborators to understand what was done and why.
Now that you have made an update and committed this change, it’s time to push these changes to the web! Or more specifically, to your repo on GitHub. Why? So that others can see your changes. And by others, we mean the course teaching team (your repos in this course are private to you and us, only). In order to push your changes to GitHub, click on Push.
Thought exercise: Which of the above steps (updating the YAML, committing, and pushing) needs to talk to GitHub?3 Only pushing requires talking to GitHub, this is why you’re asked for your password at that point.
R is an open-source language, and developers contribute functionality to R via packages. In this assignment we will use the following packages:
We use the library() function to load packages.
In your R Markdown document you should see an R chunk labelled load-packages which has the necessary code for loading both packages.
You should also load these packages in your Console, which you can do by sending the code to your Console by clicking on the Run Current Chunk icon (green arrow pointing right icon).
Note that these packages are also get loaded in your R Markdown environment when you Knit your R Markdown document.
The city of Seattle, WA has an open data portal that includes pets registered in the city.
For each registered pet, we have information on the pet’s name and species.
The data used in this exercise can be found in the openintro package, and it’s called seattlepets.
Since the dataset is distributed with the package, we don’t need to load it separately; it becomes available to us when we load the package.
You can view the dataset as a spreadsheet using the View() function.
Note that you should not put this function in your R Markdown document, but instead type it directly in the Console, as it pops open a new window (and the concept of popping open a window in a static document doesn’t really make sense…).
When you run this in the console, you’ll see the following data viewer window pop up.
You can find out more about the dataset by inspecting its documentation (which contains a data dictionary, name of each variable and its description), which you can access by running ?seattlepets in the Console or using the Help menu in RStudio to search for seattlepets.
The ✏️ symbol is a reminder to write a written response discussing the questions in the exercises.
🧶 ✅ ⬆️ Write your answer in your R Markdown document under Exercise 1, knit the document, commit your changes with a commit message that says “Complete Exercise 1”, and push. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.
🧶 ✅ ⬆️ Write your answer in your R Markdown document under Exercise 1, knit the document, commit your changes with a commit message that says “Complete Exercise 2”, and push. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.
The two lines of code can be read as “Start with the seattlepets data frame, and then count the animal_names, and display the results sorted in descending order. The”and then” in the previous sentence maps to %>%, the pipe operator, which takes what comes before it and plugs it in as the first argument of the function that comes after it.
🧶 ✅ ⬆️ Write your answer in your R Markdown document under Exercise 3. In this exercise you will not only provide a written answer but also include some code and output. You should insert the code in the code chunk provided for you, knit the document to see the output, and then write your narrative for the answer based on the output of this function, and knit again to see your narrative, code, and output in the resulting document. Then, commit your changes with a commit message that says “Complete Exercise 3”, and push. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.
Let’s also look to see what the most common pet names are for various species.
For this we need to first group_by() the species, and then do the same counting we did before.
Looks like many of those NAs were cats. Poor unnamed kitties…
## # A tibble: 16,823 × 3
## # Groups: species [4]
## species animal_name n
## <chr> <chr> <int>
## 1 Cat <NA> 406
## 2 Dog Lucy 337
## 3 Dog Charlie 306
## 4 Dog Bella 249
## 5 Dog Luna 244
## 6 Dog Daisy 221
## 7 Dog Cooper 189
## 8 Dog Lola 187
## 9 Dog Max 186
## 10 Dog Molly 186
## # ℹ 16,813 more rows
But this output isn’t exactly what we wanted. We wanted to know the most common cat and dog names, but there are barely any cats present in this output! This is because there are more dogs than cats in the dataset overall. We can confirm this by counting the various species in the data.
6 pigs in the city? Ok… But we’ll continue with cats and dogs.
## # A tibble: 4 × 2
## species n
## <chr> <int>
## 1 Dog 35181
## 2 Cat 17294
## 3 Goat 38
## 4 Pig 6
Let’s search for the top 5 cat and dog names.
To do this, we can use the slice_max() function.
The first argument in the function is the variable we want to select the highest values of, which is n.
The second argument is the number of rows to select, which is n = 5 for the top 5.
It may be a bit confusing that both of these are n, but this is because we already have a variable called n in the data frame.
## # A tibble: 53 × 3
## # Groups: species [4]
## species animal_name n
## <chr> <chr> <int>
## 1 Cat <NA> 406
## 2 Cat Luna 111
## 3 Cat Lucy 102
## 4 Cat Lily 86
## 5 Cat Max 83
## 6 Dog Lucy 337
## 7 Dog Charlie 306
## 8 Dog Bella 249
## 9 Dog Luna 244
## 10 Dog Daisy 221
## # ℹ 43 more rows
%>% print(N), where N is the number of lines that you want to print out; 2) Change the slice_max() function such that it only returns the most common name for each species (Questions: How do you do that? Why more than one name for pigs is reported?)🧶 ✅ ⬆️ Write your answer in your R Markdown document under Exercise 4. In this exercise you’re asked to complete the code provided for you. You should insert the code in the code chunk provided for you, knit the document to see the output, and then write your narrative for the answer based on the output of this function, and knit again to see your narrative, code, and output in the resulting document. Then, commit your changes with a commit message that says “Complete Exercise 4”, and push. Make sure to commit and push all changed files so that your Git pane is cleared up afterwards.
The following visualization plots the proportion of dogs with a given name versus the proportion of cats with the same name. The 20 most common cat and dog names are displayed. The diagonal line on the plot is the \(x = y\) line; if a name appeared on this line, the name’s popularity would be exactly the same for dogs and cats.
🧶 ✅ ⬆️ Now is a good time to commit and push your changes to GitHub with an appropriate commit message (Verb first and present tense, e.g. Add X or Complete X). Commit and push all changed files so that your Git pane is cleared up afterwards. Make sure that your last push to the repo comes before the deadline. You should confirm that what you committed and pushed are indeed in your repo that we will see by visiting your repo on GitHub.